The trainer package source is inside the cifar10 directory. It was based from Tensorflow's CNN tutorial and one of the Datalab image classification example.
We need to enable the ML Engine API since it isn't by default.
In [ ]:
%%bash
cd cifar10
# Clean old builds
rm -rf build dist
# Build wheel distribution
python setup.py bdist_wheel --universal
# Check the built package
ls -al dist
In [ ]:
%%bash
cd cifar10
# Set some variables
JOB_NAME=cifar10_train_$(date +%s)
BUCKET_NAME=dost_deeplearning_cifar10 # Change this to your own!
TRAINING_PACKAGE_PATH=dist/trainer-0.0.0-py2.py3-none-any.whl
# Submit the job through the gcloud tool
gcloud ml-engine jobs submit training \
$JOB_NAME \
--region us-east1 \
--job-dir gs://$BUCKET_NAME/$JOB_NAME \
--packages $TRAINING_PACKAGE_PATH \
--module-name trainer.task \
--config config.yaml
It will take a few minutes for ML Engine to provision a training instance for our job. While that's happening, let's talk about pricing!
In [ ]:
import os.path
from google.datalab.ml import TensorBoard
bucket_path = 'gs://dost_deeplearning_cifar10' # Change this to your own bucket
job_name = 'cifar10_train_1499874404' # Change this to your own job name
train_dir = os.path.join(bucket_path, job_name, 'train')
TensorBoard.start(train_dir)
Training will finish in around 8-9 hours. Make sure your training job is running properly before going!
We will deploy our trained model tomorrow and integrate it with a web app to run predictions on arbitrary images.
Ideally, you'd want to evaluate your model every X steps while training to get a log of your accuracy values.
There's an eval.py module in the trainer package that's a slightly modified copy of cifar_eval.py from the TensorFlow CIFAR-10 tutorial. We're not using it yet though. Try adding this evaluation step and re-running your training job. Don't stop our previous training job!
TIP: You can add the evaluation step as a hook in our MonitoredTrainingSession. Take a look at _LoggerHook for an example.